day 10 - ChristmaSSE KeyGen
I ran this program but it never finished... maybe my computer is too slow. Maybe yours is faster?
https://advent2019.s3.amazonaws.com/326c15f8884fcc13d18a60e2fb933b0e35060efa8a44214e06d589e4e235fe34
Recon
We're given a binary that takes no input and does a very long-running calculation.
With a bit of static analysis, we notice that it's a nested loop with the inner loop doing 1000 iterations for every iteration of the outer loop. The outer loop is supposed to do 0x112210f47de98115
(1234567890123456789
in decimal) iterations. Clearly, this will never be finished in time for the CTF. To put it into perspective, if every outer loop iteration takes 1 nanosecond, it would take ~39 years to finish the computation.
Obviously, we need to figure out what it's trying to do to optimize it. After the outer loop is done, it performs an XOR operation with a block of data given the symbol "flag" and outputs that. The necessary components are the SSE registers xmm3
, xmm2
, xmm1
and xmm0
.
We resorted to a bit of dynamic analysis at first to see if we can find a pattern, we created a binary with a different iteration count and created this gdb
script to trace SSE registers state (gdb -x bla.gdb ./slow.elf
)
define p128
printf "%016lx%016lx", $arg0.v4_int64[1], $arg0.v4_int64[0]
end
br *0x4005a3
commands
printf "xmm0: "
p128 $ymm0
printf " -- xmm1: "
p128 $ymm1
printf "\n"
printf "xmm2: "
p128 $ymm2
printf " -- xmm3: "
p128 $ymm3
printf "\n"
c
end
xmm0: ****1112****1516****0304****0708 -- xmm1: ****2728****3132****1920****2324
xmm2: ****4344****4748****3536****3940 -- xmm3: ****5960****6364****5152****5556
0x601090 <flag>: 0x0000000000000000 0x0000000000000000
0x6010a0 <flag+16>: 0x0000000000000000 0x0000000000000000
(gdb) c
Continuing.
Breakpoint 2, 0x000000000040077f in main ()
PRINT FLAG
0x601090 <flag>: 0x5960636451525556 0x4344474835363940
0x6010a0 <flag+16>: 0x2728313219202324 0x1112151603040708
Unfortunately, we couldn't identify a pattern quickly by doing that, so we resorted to reverse engineering the application. The structure of the application is as shown below in pseudocode.
Outer loop
Constants
d01 = 0x00000001000000010000000100000001
d02 = 0x00000002000000020000000200000002
d03 = 0x00000003000000030000000300000003
d04 = 0x00000004000000040000000400000004
d11 = 0x00000005000000050000000500000005
d12 = 0x00000006000000060000000600000006
d13 = 0x00000007000000070000000700000007
d14 = 0x00000008000000080000000800000008
d21 = 0x00000009000000090000000900000009
d22 = 0x0000000a0000000a0000000a0000000a
d23 = 0x0000000b0000000b0000000b0000000b
d24 = 0x0000000c0000000c0000000c0000000c
d31 = 0x0000000d0000000d0000000d0000000d
d32 = 0x0000000e0000000e0000000e0000000e
d33 = 0x0000000f0000000f0000000f0000000f
d34 = 0x00000010000000100000001000000010
Loop structure
A register suffixed with b means the value of the register at the beginning of the loop.
rep x data[n]
is repeat the \(x\)th 32-bit dword, where \(4 \leq x \leq 1\), of the 128-bit data block where \(0 \leq n \leq 5\), and data[0]
is at address 0x00601040
. The index for the dwords for the repetition is 1-based, where 4 is the highest 32-bit dword and 1 is the lowest (little endian).
xmm3 = d04 * xmm0b + d03 * xmm1b + d02 * xmm2b + d01 * xmm3b
xmm2 = d14 * xmm0b + d13 * xmm1b + d12 * xmm2b + d11 * xmm3b
xmm1 = d24 * xmm0b + d23 * xmm1b + d22 * xmm2b + d21 * xmm3b
xmm0 = d34 * xmm0b + d33 * xmm1b + d32 * xmm2b + d31 * xmm3b
Inner loop
for i in range(1000):
cmp = 0x0096433d
for reg in [xmm3, xmm2, xmm1, xmm0]:
for j, dword in reg:
if cmp <= dword:
reg[j] = reg[j] - 1
Reverse engineered python implementation
After quite a bit of clean up, we realized that it's a simple matrix exponentiation of an initial matrix modulo 0x0096433d
. Here's the reverse engineered version of the binary, simplifying all the operations down to the basics.
#!/usr/bin/python
import struct
import numpy as np
flag = "fc14eb09bcaee7474fe37cc152a5028e8971c88d9623016d71405aeafd461d23".decode('hex')
n = 0
xmm0 = np.array([0, 0, 0, 1], dtype=np.int32)
xmm1 = np.array([0, 0, 1, 0], dtype=np.int32)
xmm2 = np.array([0, 1, 0, 0], dtype=np.int32)
xmm3 = np.array([1, 0, 0, 0], dtype=np.int32)
mod = np.repeat(np.array(0x0096433d, dtype=np.int32), 4)
def pack128(v):
return v.tobytes()
while n < 0x112210f47de98115:
xmm0b = xmm0
xmm1b = xmm1
xmm2b = xmm2
xmm3b = xmm3
xmm3 = (4 * xmm0b + 3 * xmm1b + 2 * xmm2b + 1 * xmm3b) % mod
xmm2 = (8 * xmm0b + 7 * xmm1b + 6 * xmm2b + 5 * xmm3b) % mod
xmm1 = (12 * xmm0b + 11 * xmm1b + 10 * xmm2b + 9 * xmm3b) % mod
xmm0 = (16 * xmm0b + 15 * xmm1b + 14 * xmm2b + 13 * xmm3b) % mod
n = n + 1
xmm_enc = pack128(xmm3)
xmm_enc += pack128(xmm2)
xmm_enc += pack128(xmm1)
xmm_enc += pack128(xmm0)
flag_o = "AOTW{"
for idx in xrange(0, 32, 2):
flag_o += chr(ord(flag[idx + 0]) ^ ord(xmm_enc[(idx * 2) + 0]))
flag_o += chr(ord(flag[idx + 1]) ^ ord(xmm_enc[(idx * 2) + 1]))
flag_o += "}"
print(flag_o)
Solution
Knowing it's a matrix exponentiation operation modulo a certain integer, we can speed the process up with modular exponentiation. Here is the solution in sagemath:
#!/usr/bin/env python
# coding: utf-8
# In[1]:
matrix = matrix(Integers(0x0096433d), 4, 4, range(1, 17))
# Output from known working impl (at iters = 0x100000):
# ```
# [5523814 9753103 4134779 8364068]
# [8834623 3475204 7963398 2603979]
# [2297819 7044918 1944404 6691503]
# [5608628 767019 5773023 931414]
# ```
# In[7]:
matrix ** (0x100000)
# In[8]:
flag = "fc14eb09bcaee7474fe37cc152a5028e8971c88d9623016d71405aeafd461d23".decode('hex')
xmm3, xmm2, xmm1, xmm0 = (matrix ** 0x112210f47de98115)
# In[14]:
import numpy as np
xmm_enc = np.array(xmm3, dtype=np.int32).tobytes()
xmm_enc += np.array(xmm2, dtype=np.int32).tobytes()
xmm_enc += np.array(xmm1, dtype=np.int32).tobytes()
xmm_enc += np.array(xmm0, dtype=np.int32).tobytes()
flag_o = "AOTW{"
for idx in xrange(0, 32, 2):
flag_o += chr(ord(flag[idx + 0]) ^^ ord(xmm_enc[(idx * 2) + 0]))
flag_o += chr(ord(flag[idx + 1]) ^^ ord(xmm_enc[(idx * 2) + 1]))
flag_o += "}"
flag_o
Flag
AOTW{M4tr1x_3xp0n3nti4t1on_5728391723}